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Abstract 

Penalized generalized estimating equations with Elastic Net or L2-Smoothly 
Clipped Absolute Deviation penalization are proposed to simultaneously select 
the most important variables and estimate their effects for longitudinal Gaus- 
sian data when multicollinearity is present. The method is able to consistently 
select and estimate the main effects even when strong correlations are present. 
In addition, the potential pitfall of time-dependent covariates is clarified. Both 
asymptotic theory and simulation results reveal the effectiveness of penalization 
as a data mining tool for longitudinal data, especially when a large number of 
variables is present. The method is illustrated by mining for the main determi- 
nants of life expectancy in Europe. 

Keywords: Covariate selection, Generalized estimating equations, 
Longitudinal data, Multicollinearity, Penalization, Time-dependent covariates 



1. Introduction 
D 

Longitudinal data appear frequently in biomedical applications. Researchers 
are often confronted with the problem of determining the impact of different 
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covariates on a response. Correct inference can be obtained by building an 
appropriate longitudinal model. Molenberghs and Verbeke (2005) distinguish 



three types of model families: marginal models, conditional models and subject- 
specific models. After the choice of the model family, an optimal set of predictors 
has to be selected. This can be a tedious task due to a large number of potential 
covariates. Including irrelevant covariates leads to inefficient inference. There- 
fore covariate selection is an important part of longitudinal model building, 
which is the main focus of this paper. 

Variable selection in both the mixed model as a subject-specific model and 
generalized estimating equations as a marginal model will be briefly reviewed 
before turning to penalization methods within the generalized estimating equa- 
tions framework. 

Within the mixed model framework, Wu (2009) advised using significance 
testing or information criteria such as the Akaike information criterion or the 
Bayesian information criterion for the selection of fixed effects. Information cri- 



and Liu 


(2004 


) and 


Vaida and Blanchard 


(2005 


1. 


Liu et al. 


(1999) 



the idea of cross-validation to mixed models. When the number of covariates be- 
comes large, employing a stepwise search can reduce the computational burden 
for the selection techniques above. Recently, Jiang et al. (2008) have suggested 
fence methods to put up a barrier between correct and incorrect mixed models. 



In this paper we choose generalized estimating equations (GEE, |Liang and 
Zeger| 1986 ) as our inference framework instead of generalized linear mixed 



models (GLMM). The GEE approach yields population averaged effects by only 
specifying the first two moments of the outcome distribution. Its robustness 
against variance structure misspecification makes the GEE method well suited 
for our purpose of mean structure selection. Additionally, the problem of time- 
dependent covariates can be more easily clarified in the GEE context. The 
results provided in this paper are generalizable to linear mixed models as well, 
however this is not addressed here. 

Despite the focus of GEE on mean structure estimation, appropriate covari- 
ate selection techniques are not well developed in this context. The standard 
practice for GEE model building is stepwise selection based on Wald-type tests 
(see for instance Diggle et al. , 2002 ) . More recently some general variable se- 



lection techniques have been adapted to the GEE framework. Pan (2001a I 
generalized the AIC to the GEE context based on the working independence 
assumption. Cantoni et al. ( |2005[ ) suggested selection based on adequacy of 
prediction as measured by an adapted version of Mallow's C p . In addition to 
these direct methods 
plored 



more computationally intensive methods have been ex- 
Cantoni et al. (2007) combined cross-validation with a Markov Chain 



Monte Carlo based search. Alternatively, Pan (2001b) proposed minimizing a 
bootstrap smoothed cross-validation estimate of the expected predictive bias. 

However, all of these methods lack the ability to properly deal with a large 
number of covariates. Because of the discrete nature of these selection methods, 
the resulting estimator can become instable (Breiman 1996). Moreover, com- 
plete subset comparison becomes computationally unfeasible when too many 
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covariates are present, encouraging the use of a stepwise search. The gain in 
computation time by stepwise procedures comes at the price of suboptimal pre- 
diction performance and even higher instability. 

In this paper we revisit the use of penalization within the GEE context to 
both reduce the computational burden and tackle the problem of instability. 
Indeed in ordinary regression and classification problems, penalization methods 
are well suited and often used for the task of variable selection and regulariza- 



tion. The Least Adaptive Shrinkage and Selection Operator (LASSO, Tibshi- 



ram 



1996), for example, is a penalization method which achieves both subset 



selection and parameter shrinkage. The continuous nature of the shrinkage leads 
to stable selection. The LASSO transforms the dimensionality of the subset se- 
lection problem into the selection of a single continuous tuning parameter. A 
major disadvantage of the LASSO is the potentially large bias induced by its 
shrinkage effect. The Smoothly Clipped Absolute Deviation penalty (SCAD, 
Fan and Li 2001) is an adaptation to the LASSO which avoids unnecessary bias 
by using a different rate of penalization depending on the size of the coefficients. 
Smaller coefficients are penalized in the same manner as with the LASSO, while 
larger coefficients experience approximately no influence of the penalty. 



Penalized generalized estimating equations (PGEE) were conceived by Fu 



(2003) as a framework in which these penalty methods can be applied in the 
longitudinal context. His asymptotic results were concentrated on bridge pe- 



nalization (Frank and Friedman 



1993 


Fu 


1998 


)• 


Dziak 


(2006 


) and 


Dziak and 



Li (2007) extended this approach by using the SCAD penalty function. Even 



though their simulation studies display good performance of SCAD penalization 
for binomial data, we argue in Sec tion[3]that th eir asymptotic results are limited 
to the Gaussian setting. Recently, Wang et al. ( 2011 ) have properly underpinned 
the SCAD penalized GEE with asymptotic theory for a response coming from 
the exponential family. Moreover their asymptotic results are constructed in a 
high dimensional-framework, allowing for the number of covariates p to diverge 
together with the number of clusters n. Assuming only that this divergence is 
of the same order as the increase in the number of clusters {p = 0(n)), whereas 
in other aforementioned work p is assumed fixed. 

In spite of the achievements of these authors, we believe that in mining 
for important variables in longitudinal data, two issues are commonplace, often 
overlooked and could be addressed better: multicollincarity and time-dependent 
covariates. 

In order to deal with the first issue, multicollinearity, we suggest combining a 
sparse penalty function, namely the LASSO or the SCAD with a ridge part. In 
ordinary regression the elastic net (EN, Zou and Hastie 2005) has been proposed 
as the combination of the LASSO and ridge regression. Recently the SCAD 
penalty has also been combined with a ridge part by Zeng (2009), an approach 
to which we refer hence as the SCAD £2 penalty. The inclusion of a ridge part, 
adds the grouping effect to the resulting estimator. This means that highly 
correlated variables tend to be selected or omitted as a group. 

The second issue, time-dependent covariates is often overlooked in this type 
of longitudinal analysis. Time dependence in generalized estimating equations 
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will cause bias in the regression coefficients, unless either the cross-sectionallity 



assumption is satisfied or the working independence matrix is used (Pepe and 



Anderson 


1994 


Pan et al. 2000 


Diggle et al. 


2002) 



In this paper we study EN and SCAD^ 2 penalization within the framework of 
penalized generalized estimating equations with time-dependent covariates. We 
show how these methods deal with selection under multicollinearity using both 
asymptotic theory and simulation studies. We limit ourselves to the Gaussian 
setting with a fixed number of covariates and present avenues for generalization 
to the broader exponential family. 

The remainder of the paper is organized as follows. In Section 2 we dis- 
cuss the PGEE estimator with EN or SCADi2 penalty functions. As in |Dziak] 
(2006) we establish theory by turning to the equivalent penalized generalized 
least squares problem, but in contrast to Dziak (2006) argue that this is only 
possible for the Gaussian Case. The SCAD^ penalty function is shown to be 
convex under a condition on the tuning parameters. The equivalent penalized 
least square problem together with the convexity of the penalty function allows 
us to establish the grouping effect. We demonstrate how the local quadratic ap- 
proximation algorithm (Fan and Li 2001) can be used to fit such a model. We 
also address the problem of tuning parameter selection. In Section 3 we extend 
the asymptotic results of |Dziak (2006) to EN and SCAD i2 penalty functions. 
We show selection consistency, estimation consistency and asymptotic normal- 
ity. The small sample performance is investigated through simulation studies 
in Section 4. Section 5 presents a data example and in Section 6 we discuss our 
findings. 



2. Methodology 

2.1. Penalized GEE 
2.1.1. PGEE estimators 

Consider a random sample of n subjects. Yn is the measured response for 
subject i at time t with t = l,...,Tj. X it = (Xi it ,...,X pit ) is a vector 
of p time-dependent covariates measured at the same time as the response. 
Observations within a subject are correlated, observations of different subjects 
are assumed independent. 

Without loss of generality and to facilitate the use of penalization, we assume 
the response to be scaled and the covariates to be standardized. 

The cross-sectional influence of the covariates X it on the response Y it is our 
main interest. We assume Y is generated from a distribution in the exponential 
family with E(Y"i t |X^) = g^ 1 (Xf t ) , where g is a known link-function. 



The regression coefficients (3 can be estimated by solving the PGEE (Fu 



2003): 



S p ((3) = J2 DjVr\Yi - M< ) - NP((3) = 0. (1) 



i=l 
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With n t = g-Hxl); Di = D^) = duM/dfc Vi = u] I2 W(ol)U] 12 the 
working covariance matrix; W(a) is the working correlation matrix, parameter- 
ized with parameter vector ex, t/, is a diagonal matrix with diagonal elements 
Va,r(Y it \X it ); P(f3) = dP((3)/d(3 is the vector derivative of the penalty func- 
tion. Fu (2003) proposed using the bridge penalty: P(f3) = A^|/3j| 7 with 
A > 1. The bridge penalty reduces to the LASSO when 7 = 1 and to the 
ridge penalty when 7 = 2. Both have interesting properties. The first possesses 
the sparsity property, which implies irrelevant parameters can be set to zero. 
The second yields good predictive performance under multicollinearity by the 
grouping effect, meaning coefficients of correlated covariates tend to be equal. 
When 7 is chosen between one and two, the grouping effect holds, the sparsity 
property in contrast is lost. 

We propose PGEE with a penalty function combining both the sparsity 
property and the grouping effect: 

P(J3) = X 1 P L1 (f3) + X 2 P L2 ((3). (2) 

2.1.2. Sparsity 

Pli is the part of the penalty function that provides sparsity to the resulting 
estimator. We explore two possibilities. The LASSO, which uses the Ll-norm, 



P i i(/3)=^|^|, 
3=1 



and the SCAD penalty: 



PLl((3)=J2fsCAD(\m 
3=1 

fscA D (0) = f {/ (0 < Ai) + ^l=g± Z(fl > AO j dQ, 

with a > 2, /(•) the indicator function: 1(c) = 1 if c is true and if c is false; 
x + = xl(x > 0). Both provide the the sparsity property to the proposed PGEE 
estimator, because these penalties are singular at the origin (see discussion on 
sparsity in Fan and Li, 2001). The SCAD penalty behaves like the LASSO 
for small coefficients, whereas the estimator remains approximately unbiased 
for larger coefficients, because the penalty fscAD(d) is increasing with 9 and 
bounded by a constant. 

2.1.3. Grouping effect 

Pmifl) = Y^=i 0j i s ridge part of the penalty function. This provides 
the grouping effect to the resulting estimator, meaning regression coefficients of 



highly correlated variables tend to be equal ( Zou and Hastie| [2005 ) . 

Theorem 1 (Grouping effect). If for all subjects i the observed covariate 
vector Xij = x^k , with l,k € {1, . . . ,p}, then fii — fi^, with (3 = (/3i, . . . , f3 p ) 
the solution of the PGEE £7p with EN or SCADl2 penalty. 
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The grouping effect property is a direct consequence of the convexity of the 
complete penalty function [2j this is satisfied if for the EN if A2 > 0; for the 
SCADl2 this requires A2 > 2 (a-i) • ^or a P r0 °f 01 the grouping effect we refer 
to Appendix A. 

In summary, we propose to use the EN (combination of LASSO and RIDGE) 
and the SCADl 2 (combination of SCAD and RIDGE) as the penalty in equation 
Q. 

2.1.4- Non-naive estimator 

Zou and Hastie ( 2005 ) show empirically that the estimator in equation ([I]) 



with penalty function (|2j) called the naive estimator can be improved by remov- 
ing the bias caused by the ridge part. Penalization methods aim at improving 
prediction performance at a cost of little extra bias for a lower variance. By do- 
ing simulations we also observed, that a more beneficial bias-variance trade-off 
can be realized by removing the ridge-shrinkage by multiplying with the ridge 
shrinkage factor to get the non-naive PGEE estimator: 

. =3 ■ *(l + \ 9 ) 

r^non-naive Mnaivc V 1 

We will use this non-naive versus in simulation studies hence called the PGEE 
estimator. 

2.1.5. Time dependence 

In Section 3 we will establish asymptotic properties of the PGEE. These 
results are only valid if the underlying GEE estimator is consistent. The con- 
sistency property of the GEE estimator relies on the estimating equation to 



be unbiased: E(S'(/3)) = 0. Pepe and Anderson (1994) showed that when 



time-dependent covariates are present, this is not satisfied unless either the full 
covariate conditional mean (FCCM) assumption is satisfied 

fi it = E(Y u \X it ) = E(Y it \X it X ih j £ t) (3) 

or the identity working correlation matrix is used: W{ot) = I. 

Using the working independence matrix, which we propose, is a correct in- 
ference tool to assess cross-sectional associations, if the FCCM-condition is not 
met. No additional assumptions are required besides the implied mean structure 
being correctly specified: fin = E(y^|X u) . Nonetheless efficiency can be gained 
by using all required correct lagged covariates with another working correlation 
matrix. 

2.2. Algorithm 

Equation ([T|) can be so lved with the local quadratic approximation (LQA) 



algorithm of Fan and Li (2001). We start with an initial estimate /3 
(/3i,o> • • • > Pp,o) close to the solution. We thereafter iterate the following al- 
gorithm, whereby (3 t — tj • • • > Pp,t) expresses the parameter estimate at each 
iteration step t. 
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STEP 1 (Remove small values): 

If a j3j t is close to zero (closer than a predefined threshold), we set /3j jt = 
and remove these variables from the model. 



STEP 2 (Quadratic approximation): 

We approximate the derivative of the penalty P(/3) = ( P(\/3x\), . . . , P{\fl p \ 



as: 

The penalized generalized estimating equation can hence be approximated 
by a Taylor series expansion of the GEE part and the approximate derivative 
of the penalty in Q : 

S p ((3) « S((3 t ) + (/3 - f3 t ) - NU {p t ) - iVE (J3 t ) {(5 - (3 t ) , 

with 



C/(/3 t ) = E(/3 t )/9 f . 



STEP 3 (Beta update): 
We update (3 as follows: 



Pt+i = A - { - ^ (/3,)| 1 {S(f3 t ) - NU 09 t )} , t = 0, 



(5) 



We iterate through steps 1 to 3 until convergence. Convergence is reached 
when the L2 norm of the parameter is smaller than a predefined small constant 
c: 

||0t-/3 t -i|| 2 <& 

Notice that omitting variables in this way, has a similar drawback as back- 
ward selection: when during the iteration a parameter hits zero, it can no longer 
be included into the model in further iterations. The smaller the threshold, the 
less important this effect will be. Hunter and Li (120051) propose a class of 



minorize-maximize (MM) algorithms of which the LQA algorithm is a special 
case. By using a perturbed version of the LQA, the drawback of not escaping 
zero can be avoided at the price of requiring more iterations until convergence. 

A comparison of different possible algorithms falls outside the scope of this 
paper. We have implemented the LQA algorithm for solving PGEE-problems 
in the R-software. A version can be obtained from the authors upon request. 
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2.3. Tuning parameter selection 

Using a sparse penalization technique strongly reduces the dimensionality of 
the covariate selection problem. Instead of comparing all subsets of variables, 
only a limited number of tuning parameters have to be selected. In the case 
of the EN penalty these are Ai and A2. In the case of the SCADi2 penalty an 
additional tuning parameter a is involved. However, we fix a = 3.7 as proposed 



by Fan and Li (20011. To select the remaining tuning parameters, it is useful 
to reparametrize the penalty function ^ as follows: 

P(P)=\{aP L1 {0) + (l-a)P L2 (p)}, (6) 

with a € [0,1] the control between sparsity and grouping effect and A the amount 
of penalization. 



Cross-validation (CV) as proposed in Cantoni et al. (2007) can directly be 
used to select the tuning parameters. The objective of cross-validation is to se- 
lect these parameters such that a specific loss function of prediction is minimized 
(PL), for which the prediction loss of cross-validation (PLcv) is an estimator. 

The PLcv is calculated by repeatedly splitting up the sample into a test 
and a training set, fitting a model on the training set and calculating the PL on 
the test set. We then average the results. Since we are working with repeated 
measures, the splits should be taken on the subject level rather than on the 
observation level. If the number of subjects is small, leave-one-subject-out- 



cross- validation is a suitable choice. We use the approach proposed by |Cantom 



et al. (20071: 



with y\ the vector of predicted responses of subject i in the model fitted 
without this subject, Ti is the number of observations in subject i and Vi the 
working covariance matrix. In practice the PLcv can be minimized by a grid 
search. For A it is convenient to work at a log scale. For a it is sufficient to 
take a limited number of points within the [0, 1] interval. The standard error of 
PLcv can be calculated and used to identify a set of plausible models with a 
PLcv within one standard error of the optimal model. 

When one wants to avoid the the computational burden of cross-validation, 



quasi-generalized cross validation (QGCV) can be used (Fu. 2005). As CV 



QGCV attempts to minimize a PL by using an estimator: PLqgcv, 

Wdev(\,a) 
n{l-p{X,a)/N df } 

The numerator in ([8| is the weighted deviance: 

n 

Wdev{\,a) = J2rfRr 1 r i . 
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It is based on the vector of deviance residuals Ti of subject i. The longitudi- 
nal structure is incorporated by the working correlation Ri. The denominator 

in (8 1 is a correction for model complexity, with Ndf = J^i=i r^~[ the estimated 
effective number of observations and \Ri\ — "^2 Pij the sum of the elements of 
the working correlation matrix Ri. The model complexity is incorporated by 
an estimate of the effective number of parameters in the penalty model p(X, a). 
For details see |Fu| ( |2005| . 

Notice that in Fu's approach a different prediction loss function is estimated, 
based on weighted deviance residuals instead of weighted Pearson residuals. In 
equation Q the ordinary difference between observed and predicted response 
could be replaced by the deviance residual as well, although this is not proposed 
in 



Cantoni et al. (20071 



Even though cross-validation is computational intensive, we recommend us- 
ing this approach instead of the QGCV, because the approximation of the de- 
grees of freedom might not be good when using the working independence ma- 
trix. Penalization paths provide an additional useful tool to explore variable 
importance. The combination of cross-validation and penalization paths is il- 
lustrated by a data example in Section [5j 



3. Asymptotic theory 

In this section, we establish the asymptotic properties for the naive PGEE 
estimator, when the number of subjects n goes to infinity. These theorems and 



their proofs are adopted from |Dziak ( 2006 1 . Asymptotic properties are derived 



by turning to the equivalent penalized generalized least squares (PGLS) prob- 
lem, (see Lemma[7]in Appendix B). This is only possible for the Gaussian case. 
The reasoning behind the proofs together with a list of regularity conditions 
required for these proves to hold, is provided in Appendix B. Collinearity, the 
main context in which this type of penalization is useful, exerts no influence 
on these asymptotic results, but if condition ([3| is not satisfied, the working 
independence correlation structure is a necessary condition for consistency to 
hold. 

Theorem 2 (-y/n-consistency). For the EN-penalty with Ai = O v {n~ x l 2 ) and 
A2 = O^n -1 / 2 ), or for the SCAD ^-penalty with Ai = o p (l) and A2 = O p (n -1 / 2 ), 
under normality and regularity conditions 1-4 in Appendix B. there exists a 
sequence (3 n of solutions to the PGEE equation ( ly such that (3 n — f3 
O p {n-^). 



Theorem [2] indicates the asymptotic estimation consistency of the PGEE 
estimator with EN or SCADl2 penalty when the number of subjects n goes to 
infinity. This result holds if there exists a fixed true underlying vector of regres- 
sion coefficients (3. The difference in conditions on the tuning parameters for the 
SCAD part on the one hand and the LASSO and ridge parts on the other can 
be attributed to the difference in the derivative of the penalty function for the 
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non-zero regression coefficients. In order to achieve -^/n-consistency this rate of 
penalization must fade away at a high enough rate. Since the derivative of the 
SCAD is zero for coefficients larger than aAi, a moderate decrease in penaliza- 
tion is sufficient. For asymptotic results of the SCAD penalty under a diverging 
number of parameters we refer to Wang et al. (2011). Asymptotic theory in 



the broader context of penalized estimating function has also been described 
by Johnson et al. (20081, who included results on the EN as an approximate 



zero-crossing. 

The aim of our proposed PGEE estimator is not only to consistently estimate 
the parameters of a regression model, but also to select the variables for that 
model. In order to get more insight in the selection properties of this estimator, 
we partition our vector of regression coefficients into two subgroups: 

(3 = {[3 A ,f3 M ), 

with A = {j : j3j 7^ 0} the vector of indicators of non-zero coefficients, which 
we want to retain in our model; with J\f — {j : f3j = 0} the vector of indicators 
of zero coefficients, which we want to omit from our model. We will establish 
selection consistency in two steps. First we show that all active coefficients f3 A 
are retained in Lemma [3] Second we establish conditions under which the zero 
coefficients (3^ will be dropped in Lemma [4] 

Lemma 3 (Sensitivity). Under the conditions in Theorem^ there exists a 
sequence f3 n such that the active coefficients are included in the model with 
probability approaching one, i.e., Pr(3j G A : ft = 0) = o(l) 

This follows directly from Theorem [2j We assume that f3j is fixed. And take 
some small e > 0. Then: 



Pr(3j € A : ft = 0) < Pr(3j E A 



ft " ft 



> e) < Pr( 



(3-f3 



> e 2 ) = (l). 



For the asymptotic estimator to be sparse, we need an additional condition 
on the strength of tuning parameter Ai: 



Nn~ 1/2 \ 1 oo, 



(9) 



as n — > oo. 

Let us assume the number of observations N grows proportionally with the 
number of subjects n. It follows that Q is in conflict with the conditions of 
Theorem [2] for the EN, but not for the SCAD L2 . 

Lemma 4 (Sparsity). For the SCAD^ penalty, under conditions in Theo- 
rem^and condition there exists a sequence n of solutions such that (3 ^ is 
y/n- consistent for [3^, and that (3^- = f3j^ = 0. 

Theorem 5 (Selection consistency). Under the conditions of Theorem^and 
condition the SCAD^ is asymptotically able to include the active set (Pr(3j £ 
A : ft = 0) = o(l)) and to omit the non active set {(3^- — f3j^ — 0) . 
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This follows from Lemma [3] and Lemma [4j 



Theorem 6 (Asymptotic normality). Under the conditions of Theorem^ 
there exist a sequence (3 of solutions to equation such that: 



0, 



iV(0,*). 



(10) 



For details on <fr, see Dziak (20061 



We notice this asymptotic normality proof allows us to derive standard er- 
rors, and therefore to calculate p- values of Wald tests for parameter coefficients. 
However, no standard errors are provided for the parameters put to zero. There- 
fore we recommend using the bootstrap to obtain standard errors for all esti- 
mates. 



4. Simulation studies 

4--1- Aims and setting 

We compare the effect of penalization on selection and prediction perfor- 
mance on simulated longitudinal data with multicollinearity and time-dependent 
covariates, using different penalty functions (LASSO, RIDGE, EN, SCAD and 
SCAD^)- Two main settings will be investigated: 



(1) In Section 4.2 we simulate from a cross-sectional longitudinal process 
and fit a PGEE model with the same time-dependent covariates. In this setting 
the FCCM-assumption (see equation (l3l) is satisfied. 



(2) In Section 4.3 we simulate from a process which includes lagged covari- 
ates. The PGEE estimator is used to assess the implied cross-sectional associa- 
tions between response and covariates. In both cases the working independence 
matrix will be used. However this requirement could be relaxed in the first 
setting. 

In each case the selection performance will be expressed as the percentage of 
correct and incorrect deletions. The prediction performance will be measured 



by the model error (ME, |Fan and Li] |2001[ ) 



ME(/i) = (/3-/3) T E(AA T )(/3-/3). (11) 

The model error is the part of the prediction error which can be influenced 
by the quality of the estimator fi. For a perfect estimator, this quantity is zero. 
The tuning parameters for the different penalty models will be selected by leave 
one subject out cross-validation. 

4-2. Cross-sectional model 

We simulated longitudinal datasets with n = 20 subjects, Tj = T = 5 
observations per subject from the following longitudinal process: 

Y it = Xj t p + e it . 
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Table 1: Cross-sectionality assumption satisfied. Comparison prediction error via the ME, 
with Standard error (S.E.) and selection performance via ratio of correct deletions (CD.) and 
ratio of Incorrect deletions (I.D.) for 100 simulations. Tuning parameters are selected using 
cross-validation, the median tuning parameters are displayed. 



Penalty 


ME (S.E.) 


(A, a) 


CD. 


I.D. 


GEE 


0.098 (0.006) 




0.000 


0.000 


LASSO 


0.095 (0.006) 


(0.041; 1.000) 


0.347 


0.000 


RIDGE 


0.096 (0.005) 


(0.010; 0.000) 


0.013 


0.000 


EN 


0.095 (0.005) 


(0.041; 0.500) 


0.333 


0.000 


SCAD 


0.084 (0.006) 


(0.079; 1.000) 


0.630 


0.000 


SCAD L2 


0.081 (0.006) 


(0.170; 0.643) 


0.650 


0.002 



Yit is the response for subject i at time t, f3 = ( — 1, — 1, 1, 1, 0.5, 0, 0, 0) T ; ej = 
(en, . . . ,eiT) T multivariate normal with a first order autoregressive correlation 
structure with p = corr^e^jt, ej^t+i) = 0.7. Xu ~ Ng(Q, S); £ is the correlation 
structure of the covariates at each time point. We use the following structure: 
corr(A 1; A 2 ) = 0.6; corr(AT 3 , AT4) = 0.3. All other correlations between X- 
variables are set to zero. 

Table [T] illustrates the impact of penalization on prediction and selection 
performance. We notice only a small improvement in the ME by ridge, EN 
or LASSO penalization. The selection performance of the LASSO is slightly 
better than the EN, but still only removes about 35% of variables which should 
be removed. 

The SCAD in contrast strongly reduces the ME and achieves a good se- 
lection, considering the small sample size. With the inclusion of an additional 
ridge component, the SCAD/, 2 accomplishes even a further reduction in the ME 



and a better selection, although the convexity condition ((A. 2) in Appendix A) 
is not satisfied. Also in other settings, we have generally observed SCAD i2 pe- 
nalization to outperform the ordinary SCAD when collinearity is present, even 



if condition (A.2) is not fulfilled. 

The large difference in the ME and selection performance for the EN and 
the LASSO on the one hand and the SCAD£ 2 and the SCAD on the other, can 
likely be attributed to the combination of estimation and selection consistency 
(see Theorem [2] and [5]) of the SCADr, 2 and the SCAD, which is lacking for the 
EN and the LASSO. 

4-.S. Implied cross-sectional model 
4. 3.1. Data generation 

We make two adaptations to the simulation study in Diggle et al. (2002). 
First we increase the number of time-dependent covariates to twenty. Second 
multicollinearity is introduced by imposing a correlation structure between the 
covariates at each time point. The data is simulated from following process: 

{Y u = Xf tll + Xj t _ ll2 + b t + e it , 
[Xj,it = PjXj t a-i + ejst, Vj = 1, . . . , 20. 
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Both response and covariates have a time-dependent structure. Where i is 
the indicator of the subject, t is the time indicator, and j indicates the covariate. 
Here e lt ~ N(0, 1), h ~ N(0, 1), e M ~ N(0, 1), Cj -, it ~ N(0, 1 - pf ) for i > 0, all 
mutually independent. Jf it = (A ljit , . . . , X 2 o,it) ~ N 2 o(0, E). The covariance 
structure E = diag^Ei, Ei, Sj, In) has a block diagonal structure, with 



Ei = 



1 

0.3 
0.5 



0.2 0.5 
1 0.4 
0.4 1 



Notice that we consider two different types of dependence between the X- 
variables. On the one hand, each covariate for each subject Xj t a follows its 
own autoregressive time evolution. On the other hand, within each subject, at 
each time point different covariates have a cross-sectional correlation character- 
ized by their covariance structure E. 
We look at two distinct scenarios: 

• Scenario 1: 

f 7i = 7 2 = ((2, 2, 2), (1, 1, 1), (0.1, 0.1, 0.1), 0,0,...), 
\ Pj =0.5,Vj = 1,...,20. 

• Scenario 2: 

(7i =7 2 = ((1,1,1), (1,1,1), (0,0,0), 0,0,...), 
[p = (0.3,0.3,0.3,0.6,0.6,0.6,0.5,0.5,...). 

With p = {p x , . . .,P2o)- 

For each scenario, we simulate 100 datasets with 20 subjects and 100 datasets 
with 100 subjects. 



4-3.2. Cross-sectional model 

Suppose the objective is to assess the cross-sectional associations between Y it 
and Xi t . We therefore estimate the parameters of the implied cross-sectional 
model: 

E(Y u \X it ) = Xj t f3. 

The properties of the multivariate normal distribution allow us to calculate the 
coefficients (3 of this implied model: 

P = 1\ +P T 7i- 

4-3.3. Simulation results 

Table [2] summarizes different PGEE fits on the 100 simulated datasets for 
each scenario. SCAD^ 2 penalization yields for all scenarios the best prediction 
performance of all PGEE estimators. The model error is strongly reduced by 
using a sparse penalization technique (LASSO, EN, SCAD, SCADi 2 ), especially 
when only 20 subjects are available. Ridge penalization in contrast improves 
the model error only modestly. For scenario 2 with 20 subjects, we see that the 
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EN has a smaller model error than the SCAD. The beneficial grouping effect 
more than compensates for the lack of unbiasedness of the EN in this example. 
In general we observe the SCAD^2 to outperform the SCAD and the EN when 
strong multicollinearity is present. 

In addition the selection performance of the SCAD^ 2 is superior to that 
of the EN or the LASSO. However, a trade-off between correct deletions and 
incorrect deletions is observed. In Table [2] we see that ordinary GEE and ridge 
penalization have sometimes a small percentage of deletions. This is due to the 
algorithm (see Section 2.2), which includes a threshold term. 

When we look at the relative bias, we observe the overall sample bias of the 
first beta coefficient of the SCAD and SCAD^, not to be larger than that of 
ordinary GEE. For the LASSO and the EN we notice a substantial negative 
bias. 



4- 3. 4- Extension to binomial data 

If we apply the logit transformation to the mean structure of these two sce- 
narios, we can use this quantity as a probability to simulate from the Bernoulli 
distribution. We simulate data from following longitudinal process, which is a 
generalized linear mixed model: 

Xj,u = PjXj.u-i + tj.it, Vj = 1, . . . , 20, 
< logit(p it ) = Xf t j 1 + X£_ X 7 2 + b h 
Y it <~ Bern{p it ), 

where i indicates the subject, t is the time indicator and j indicates the covariate. 
The response Yu is generated from the Bernoulli distribution, where the success 
probability pa is linked by the logit transform to a linear mean structure, with a 
subject-specific effect hi ~ N(0, 1). The covariate process is autoregressive, with 
e l0 ~ N(0, 1), Xu = {Xi tit , . . . ,X 2 o,u) ~ N 20 (0, E). The same scenarios as for 
the Gaussian case were used. For details we refer to the web based supplement. 

The results of the PGEE estimator based on the binomial distribution are 
very similar to the ones we have displayed for the Gaussian case, indicating 
the suitability of these penalties outside the Gaussian context. However the 
improvement in the model error, even under optimal tuning are modest in com- 
parison to the Gaussian case. For scenario 1 (100 subjects) the ordinary GEE 
fit displayed a model error of 0.0877 whereas the best penalization method, the 
SCAD^2 had a model error of 0.0833. For more details on simulation results in 
the non-Gaussian setting we refer to the supplementary material to this paper. 



5. Determinants of female life expectancy in Europe 

In health economics, identifying the most influential determinants of health- 
outcomes is a major research topic. If observations are taken over time, and 
the number of potential determinants is large, our proposed PGEE estimator 



can facilitate the selection of the most important ones. Beutels et al. (7-10 
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Table 2: Cross-sectionality assumption not satisfied. Comparison of the different penalty 
functions on the basis of prediction performance (model error), selection performance (ratio 
of correct deletions (CD.) and ratio incorrect deletions(I.D)) and relative bias of the first coef- 
ficient. Tuning parameters are selected using cross-validation, the median tuning parameters 
arc displayed. 



Scenario 1, 20 subjects 

Penalty ME (S.E.) (A, a) CD. LP. Rcl. Bias 

GEE 6.489 (0.263) - 0.002 0.000 -0.103 

LASSO 4.259 (0.215) (0.405; 1.000) 0.440 0.169 -0.158 

RIDGE 5.593 (0.204) (0.092; 0.000) 0.002 0.001 -0.182 

EN 4.144 (0.193) (0.405; 0.821) 0.392 0.144 -0.154 

SCAD 4.541 (0.287) (0.425; 1.000) 0.549 0.216 -0.089 

SCAD L2 3.744 (0.198) (0.518; 0.786) 0.500 0.184 -0.088 

Scenario 1, 100 subjects 

GEE 0.942 (0.027) - O004 (1001 4X021 

LASSO 0.663 (0.029) (0.151: 1.000) 0.428 0.099 -0.047 

RIDGE 0.934 (0.029) (0.013; 0.000) 0.011 0.001 -0.036 

EN 0.681 (0.030) (0.193; 0.929) 0.417 0.097 -0.048 

SCAD 0.445 (0.026) (0.287; 1.000) 0.657 0.154 -0.015 

SCAD L2 0.438 (0.025) (0.349; 0.857) 0.662 0.158 -0.019 

Scenario 2, 20 subjects 

GEE 3.018 (0.115) - O003 (1000 4L058 

LASSO 1.988 (0.100) (0.247; 1.000) 0.446 0.003 -0.274 

RIDGE 2.463 (0.086) (0.092; 0.000) 0.006 0.000 -0.246 

EN 1.884 (0.086) (0.316; 0.643) 0.383 0.002 -0.261 

SCAD 2.280 (0.147) (0.287; 1.000) 0.494 0.023 -0.206 

SCAD L2 1.626 (0.102) (0.425; 0.643) 0.431 0.010 -0.202 

Scenario 2, 100 subjects 

GEE 0.456 (0.014) - O003 (1000 4L020 

LASSO 0.309 (0.014) (0.118; 1.000) 0.451 0.000 -0.080 

RIDGE 0.449 (0.015) (0.021; 0.000) 0.006 0.000 -0.046 

EN 0.317 (0.015) (0.151; 0.929) 0.441 0.000 -0.086 

SCAD 0.161 (0.010) (0.235; 1.000) 0.736 0.000 -0.011 

SCAD L2 0.150 (0.009) (0.287; 0.857) 0.722 0.000 -0.029 
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May 2011 ) for instance try to search for determinants of antibiotic consumption 



between European countries and over time. We will use a subset of this original 
dataset to search for determinants of female life expectancy. All available years 
between 1999 and 2005 are used in our analysis, for the following 13 countries: 
Austria, Belgium, Denmark, Estonia, Finland, France, Germany, Ireland, Italy, 
the Netherlands, Spain, Sweden and the United Kingdom. The data availability 
is summarized in Table C.5 in the Appendix. 

The objective of this study is to find the most important determinants, which 
could explain observed differences in female life expectancy at birth between 
countries and over time. A set of 42 potential determinants, having a potential 
direct or indirect effect on life expectancy, are included. These determinants 
can be roughly structured into six groups: death rates such as the death rate 
due to pneumonia; socio- economics such as GDP per capita and hours in a work 
week; culture such as the four dimensions of Hofstcde: Power distance index, 
Individualism, Masculinity and Uncertainty avoidance; factors that character- 
ize infectious disease preventions such vaccination coverage against pertussis; 
organization of the health care system such as hospital beds and demographics 
such as population density. 

These different covariates are correlated, and vary over time. The ordinary 
GEE estimator is unable to fit a full model, because of the limited number of 
data points (44) compared to the number of covariates (42). We assess the 
cross-sectional associations between response and covariates using the PGEE 
estimator. Both covariates and response are standardized prior to analysis. We 
employ PGEE with the penalty function parametrized as in equation ([6]). Leave- 
one-subject-out-cross-validation is then performed on a grid of a and A values. 
For a the two extremes and 1 are included. On this grid of tuning parameters 
the minimum of the PLcv is selected for both the EN and the SCADl 2 . For the 
EN, we find a PL C v = 0.0348 with a standard error of 0.0035. The SCAD L2 
outperforms the EN with a PLcy = 0.025, with a standard error of 0.0020. For 
the EN, this minimum is found for the tuning parameters (a; A) = (2.857E — 
4);3.393E - 4). The SCAD L2 reaches its optimum at (a; A) = (0.786; 0.0146). 
In both cases this minimum was significantly different from the two extreme a 
values. In Table [3] and [4] the standardized coefficients of both the EN and the 
SCAD L2 are reported. The EN selects 27 out of the 42, whereas the SCAD L2 
reduces this number further to 20. The order of the standardized coefficients 
differs only slightly between EN and SCAD^ 2 . 

For EN penalization, a shrinkage plot gives some insight into the impact of 
the different variables on the response. Figure [T] displays the shrinkage paths of 
the ten largest standardized coefficients at the selected fit. We keep a fixed and 
observe the evolution of standardized coefficients if A decreases. This shrinkage 
plot indicates the sensitivity of the fit with respect to A. If A increases we see 
that some variables are excluded from the model such as the percentage children 
vaccinated against pertussis, whilst others increase their influence such as the 
infants death. 

The grouping effect implies that correlated variables are pulled towards each 
other with increasing penalization. The coefficients of death rates due to is- 
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1e+00 1e-01 1e-02 1e-03 

lambda 



Figure 1: EN plot: Standardized coefficients for the ten largest coefficients at optimal (a, A) 
combination, as a function of lambda. Lambda is indicated in reverse log-scale. The coeffi- 
cients are numbered by their ordering of the EN-fit as in table [3] 
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chaemic heart disease and of death rate due to chronic illness, for instance, 
seem to converge when more penalization is performed. Our simulation stud- 
ies have shown that this property results in better estimation of the regression 
parameters under multicollinearity, however no clear grouping structure can be 
identified using these methods. The shrinkage paths can at most be indicative 
for any grouping structure in the covariates. 

The PGEE fit provided a selection of all main determinants of female life 
expectancy in Europe, even in case of high correlation amongst the different 
covariates. In this manner, six specific death rates (due to ischaemic hearth 
disease; chronic diseases, pneumonia; cancer; bronchitis, asthma or emphysema; 
overall infant mortality) turned out to be the 'big fish' (Zou and Hastie 20051 
correlated and all predictive for lower average female life expectancy. Besides 
these identified death rates vaccination against pertussis, working hours per 
week, level of education, relative humidity and birth rate were all predictive 
for higher female life expectancy, while alcohol consumption and the percentage 
of people who believe that most people can be trusted predicted lower average 
female life expectancy. Penalized estimating equations can generally be applied 
to identify determinants for health outcomes as illustrated by this case study. 



6. Discussion 

In this paper we addressed the problem of variable selection in longitudi- 
nal data under multicollinearity and time-dependence. We proposed using the 
PGEE estimator under working independence to identify the most important 
cross-sectional associations in this context. In order to deal with potential mul- 
ticollinearity, the use of a penalty function combining both the sparsity and 
grouping effect was proposed, namely the EN or the SCADi2- We proved the 
grouping effect as a consequence of the convexity of the penalty function and 
established the conditions for the SCAD^2 to be convex. An additional con- 
sequence of this convexity is the avoidance of the multiple roots problem in 



penalized GEE with the SCAD penalty, mentioned in Wang et al. (2011 1. 

Asymptotic theory reveals that both the EN and the SCADi2 penalty func- 
tions can achieve a consistent fit, however only the SCAD^ has the addi- 
tional property of selection consistency. This is also reflected in our simula- 
tion studies were the SCAD^ outperformed all other methods (ordinary GEE, 
LASSO, ridge, EN, SCAD) with respect to prediction performance. Although 
the SCADl2 displayed overall better prediction performance this might not al- 
ways be the case. In practice we recommend exploring variable importance 
using EN-paths and base final inference on the model selected (EN or SCAD^ 2 ) 
through on cross-validation. 

Combining penalty functions has proven beneficial for selection and predic- 
tion, however this requires the selection of additional tuning parameters. In our 
data example we used cross-validation to select these parameters, which is com- 
putationally intensive. Establishing appropriate information criteria is crucial 
to the further development of penalty methods within GEE. 
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Adaptive versions of the LASSO and the EN (Zou 2006 Zou and Zhang 



20091 have not been considered as alternatives to the SCAD and the SCAD^ 2 
in this paper. In ordinary regression these penalty methods achieve selection 
consistency by using adaptive weights to different coefficients within the penalty 
function. These weights are the inverse of an initial estimate. In this manner 
larger initial coefficients are penalized less. The asymptotic theory in this paper 
does not incorporate adaptive versions of the penalty functions with weights 
dependent on the data at hand. Moreover in the presence of a large number 
of covariates, getting a good initial estimate is not feasible. A small simula- 
tion study (not included) suggest that the performance of the PGEE with the 
adaptive EN is only slightly better then the PGEE with the EN penalty, when 
a small number of potential covariates is present. We have also explored an 
adaptive SCADi2, but this performance was generally worse than the ordinary 
SCAD L2 . 

In this paper, we limited ourselves to look for cross-sectional associations in 
the Gaussian setting with an exploration of the binomial case. The develop- 
ment of asymptotic theory for the non-Gaussian setting and adaptations to the 
penalty function to incorporate lagged covariates are topics for future research. 
Asymptotic theory developed in Dziak ( 2006 1 and extended in this work cannot 
be employed in the non-Gaussian context because of the absence of an objec- 
tive function. Asymptotic theory in the broader context of penalized estimating 
functions has also been described by Johnson et al. ( 2008 ) . This approach might 
be fruitful to derive asymptotic properties for the SCADl 2 . 
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Appendix A. Proof of the grouping effect 

We consider the PGEE as a penalized generalized least squares problem 
(PGLS), (see Lemma W\. It is then sufficient to show that the EN and SCADl 2 
penalty functions are strictly convex in /3. This is satisfied for the EN if A 2 > 0. 
For the SCAD^ 2 this is satisfied if A 2 > 2 (a-i) ( see Lemma JsJ) . 

Thus choosing A 2 > 2 (a-i) Y ie ^ s the desired grouping effect for the SCAD £ 2 
PGEE estimator. Notice that the grouping effect also holds for the non-Gaussian 
case with an identity working correlation matrix, because in that case, solving 
the PGEE is equivalent to optimizing a penalized likelihood. And the grouping 
effect can be proved using the same logic. 

Lemma 7 (PGEE as penalized generalized least squares). In the Gaus- 
sian case, solving the PGEE equations is equivalent to minimizing a penalized 
generalized least squares problem. 
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Table 3: Table with standardized coefficients for the PGEE estimator with both EN and 
SCAD/,2 penalty function. Both response and covariates were standardized before applying 
penalization. 

ENfit SCAD L2 fit 



Variable 


Order 


Stand Coeff 


Order 


Stand 


Death rate due to ischacmic 


1 


-0.343 


1 


-0.343 


heart disease 










Death rate due to chronic dis- 


2 


-0.320 


3 


-0.259 


eases 










Death rate due to pneumonia 


3 


-0.276 


2 


-0.284 


Death rate due to cancer 


4 


-0.149 


5 


-0.144 


Death rate due to bronchitis 


5 


-0.148 


8 


-0.130 


asthma & emphysema 










Infant deaths per 1000 live births 


6 


-0.126 


9 


-0.128 


Percentage of urban population 


7 


-0.125 


4 


-0.166 


Percentage of infants vaccinated 


8 


0.112 


13 


0.117 


against pertussis 










Uncertainty Avoidance Index 


9 


0.102 


7 


0.136 


Death rate of AIDS 


10 


0.098 


6 


0.140 


Relative humidity (average dew 


11 


0.094 


16 


0.094 


point in degrees Celsius) 










Hours worked per week of full 


12 


0.084 


14 


0.097 


time employment 










Expectancy of educational level 


13 


0.080 


10 


0.123 


in years 










Percentage of people who believe 


14 


0.076 


17 


0.090 


that most people should not be 










trusted 










How strongly people find having 


15 


-0.071 


12 


-0.118 


experts, not government, make 










decisions for a country a bad 










thing 










Birth rate 


16 


0.065 


11 


0.122 


Pure alcohol consumption, liters 


17 


-0.057 


15 


-0.096 



Coeff 



per capita 
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Table 4: Table with standardized coefficients for the PGEE estimator with both EN and 
SCAD^2 penalty function (continued). Both response and covariates were standardized before 
applying penalization. 



Variable 




EN fit 


SCAD L2 fit 


Order 


Stand Cocff 


Order 


Stand Cocff 


Women to men ratio 


23 


0.016 


97 


n 


Private honseh olds' ont-of- 


24 


0.012 


36 


o 


TiopKot na vm cut on hca 1 1 n a s % 

yj VJ IvV_ Lr I J Ci y 1111_>11 U l_/ll llv^Ciil Lilt Ci O / U 










of total health expenditure 










% of people attaining the edu- 


25 


-0.006 


28 


o 


cational level of upper secondary 










school 










Practicinc nhvsicians oer 1 00 000 


26 


-0.004 


34 


o 


Individualism 


27 


0.000 


22 


o 


Alasculinity 


28 





21 


o 


Power Distance Index 


29 


o 


AO 


n 

VJ 


Total health expenditure onr- 


30 


o 


24 


o 


chasing power parity in dollar 










per capita 










Average number of people per 


31 





25 


o 


room in an occupied housing unit 










P)eath rate due to accidents 

Jv v> Cl> U 11 1 <_Ij L vl l_lv> UW CLv_'\_'l*_i.\_'li. U O 


32 


o 


26 


n 


Standard deviation of absolute 


33 


o 


29 


n 


hinniditv 

11 IX111H_X1 U V 










How stronP"lv resnect for a an- 

X Ivy VV ull UlltLl y IL/OUOvt iUl (X< Cat Lt 


34 


o 


30 


o 

VJ 


th or i tv i s nerci eved as a bad 










thing 










How strongly atheist versus reli- 


35 





31 





gious people describe themselves 










Poverty rate 


36 





32 





Average Population Density per 


37 





33 





km 2 










Public sector expenditure on 


38 





35 





health as % of total government 










expenditure. 










Hospital beds per 100 000 inhab- 


39 





37 





itants 










Death rate due to chronic liver 


40 





40 





disease 










Death rate due to diabetes Mel- 


41 





41 





litis 










Death rate due to alcohol abuse 


42 





42 






21 



Proof. Consider the following penalized generalized least squares problem: 

Q p (0) =Q{p)+NP{p), (A.l) 

with: 

Q09) = ^-S^K^S, 
An 

~' IdS (/3) 



1 1 n 5/3 



This is only valid for the Gaussian case, because otherwise D , is a function 
of /3. 

Solving the PGEE in equation Q is hence equivalent to minimizing the 
objective function Q p (/3) in equation (|A.l|). 



Proof (Theorem [TJ. Suppose fa ^ /3 fc , let us consider j3* as: 

f j3j = fa, if j ^ k and j ^ I 

\p* = \{p k + fa), ii j = k or j = l 

Suppose Xi t i — Xi t k for all subjects i, then Q(p) = Q((3*), because Xj/3 = 
Xi/3* = £>i,V subjects i, with X% — (x itl , . . . ,x iyP ). Since P(/3) is strictly 
convex, P(f3) > P((3*). Therefore j3 cannot be the minimizer of Q p (f3), which 
is a contradiction. Therefore fa = $k- 

Lemma 8 (Convexity of the penalty function). If A2 > 9( a _x) > theSCAD^- 
penalty is strictly convex. 



Proof. Given 

< Ax) + 7' 1 "( + I(0 > Ax) } old + X 2 2 
[a — 1)a 

we calculate the second derivative of P{9): 



P{6) =\J [l{6 < Ax) + ( " Al ' 7 + * ■ 1 - * ^ 



\ 2 , if ^ < Ax, 

H«i = { ^i + 2A 2) ifAi<0<aAx, 
\ 2 , if aAx < 6. 

This second derivative is always positive if: 



A2 > 2^1)- ^ 



Under condition (A.2) the SCAD^2 penalty function is strictly convex in the 
parameter vector /3. 
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Appendix B. Background Asymptotic theory 



For the asymptotic properties to hold, following regularity conditions are 
required, which are taken over from Dziak (2006). 

Regularity condition 1: S((3) and K(f3) = \ £^ =1 D I V i~ lD i have con- 
tinuous third derivatives in (3. 

Regularity condition 2: K(f3) is positive definite with probability ap- 
proaching one. there exist a non-random function Kq ((3) such that \\K((3) — Kq{/3)\\ 
uniformly, and Kq((3) > for all (3. 

Regularity condition 3: The Si = DfV~ 1 (Yi — fi i ) have finite covariance 
for all (3. 

Regularity condition 4: The derivatives of Kq((3) in (3 are O p (l) for all 

(3. 



Proof (Theorem 2: ^-consistency) . For < e < 1, it is sufficient to 
show that with probability at least 1 — e, 3 some large constant C e , such that a lo- 



cal solution f3 n of Q p ((3) exist in the interior of the ball {/3 - 



such that 



Dziak 



$ n — (3 = Opin- 1 / 2 ) will be Op^" 1 / 2 ), (Fan and Li 
2006). The conditions on the tuning parameters make sure, that 
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asymptotically dominates the penalty part. 



PROOF (Lemma 2: Sparsity). Condition ^ provides that in a neighborhood 
where the non-active coefficients are close tho zero, the sparse part has asymp- 
totically still enough influence to put the estimates of the non-active coefficients 
exactly equal to zero: 



Q P (f3 A , 0) = argmm {Q P ((3 A , (3 B )} 
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Table C.5: Table of availability of all covariates and response for the different Country- Year 
combinations, 1 indicates available, non available 



Country 


1999 


2000 


2001 


2002 


2003 


2004 


2005 


Austria 


1 


1 


1 





1 








Belgium 


1 




















Denmark 








1 














Estonia 














1 


1 





Finland 


1 


1 


1 


1 





1 


1 


France 


1 


1 


1 


1 











Germany 


1 


1 


1 














Ireland 


1 


1 


1 





1 


1 





Italy 


1 


1 


1 














Netherlands 


1 


1 


1 





1 








Spain 


1 


1 


1 


1 


1 








Sweden 


1 








1 











United Kingdom 


1 


1 


1 


1 
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